{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "预处理\n",
    "===\n",
    "\n",
    "在你对数据集进行模型训练之前，需要将其预处理成预期的模型输入格式。无论你的数据是文本、图像还是音频，它们都需要被转换并组装成一批批的张量。Transformers提供了一组预处理类，帮助你为模型准备数据。在本教程中，你将了解到：\n",
    "\n",
    "* 文本，使用标记器(Tokenizer)将文本转换为一连串的标记，创建标记的数字表示，并将它们组装成张量。\n",
    "* 语音和音频，使用特征提取器(Feature extractor)从音频波形中提取顺序特征，并将其转换为张量。\n",
    "* 图像输入使用一个ImageProcessor将图像转换成张量。\n",
    "* 多模态输入，使用一个处理器来结合标记器和特征提取器或图像处理器。\n",
    "\n",
    "> AutoProcessor总能发挥作用，为你使用的模型自动选择正确的类，无论你使用的是标记器、图像处理器、特征提取器还是处理器。\n",
    "\n",
    "在你开始之前，请安装Datasets，这样你就可以加载一些数据集进行实验："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "pip install datasets"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 自然语言处理(Natural Language Processing)\n",
    "\n",
    "对文本数据进行预处理的主要工具是标记器。标记器根据一组规则将文本分割成标记。这些标记被转换成数字，然后被转换成张量，成为模型的输入。模型所需的任何额外输入都由标记器添加。\n",
    "\n",
    "> 如果你打算使用一个预训练的模型，那么使用相关的预训练的标记器是很重要的。这确保了文本的分割方式与预训练语料库相同，并在预训练过程中使用相同的对应标记-索引（通常被称为词汇）。\n",
    "\n",
    "通过AutoTokenizer.from_pretrained()方法加载一个预训练的标记符。这将下载一个模型预训练过的词汇："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"bert-base-cased\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "然后将你的文本传递给标记器："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "encoded_input = tokenizer(\"Do not meddle in the affairs of wizards, for they are subtle and quick to anger.\")\n",
    "print(encoded_input)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "{'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102], \n",
    " 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], \n",
    " 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}\n",
    "```\n",
    "\n",
    "tokenizer返回一个有三个重要项目的字典：\n",
    "\n",
    "* input_ids是对应于句子中每个标记的索引。\n",
    "* attention_mask表示一个标记是否应该被关注。\n",
    "* token_type_ids在有多个序列的情况下确定一个标记属于哪个序列。\n",
    "\n",
    "通过对input_ids的解码，返回你的输入："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "tokenizer.decode(encoded_input[\"input_ids\"])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]'\n",
    "```\n",
    "\n",
    "正如你所看到的，标记化器在该句中添加了两个特殊的标记-CLS和SEP（分类器和分离器）。并非所有的模型都需要特殊的标记，但如果它们需要，标记器会自动为你添加它们。\n",
    "\n",
    "如果有几个你想预处理的句子，把它们作为一个列表传给标记器："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "batch_sentences = [\n",
    "    \"But what about second breakfast?\",\n",
    "    \"Don't think he knows about second breakfast, Pip.\",\n",
    "    \"What about elevensies?\",\n",
    "]\n",
    "encoded_inputs = tokenizer(batch_sentences)\n",
    "print(encoded_inputs)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102], \n",
    "               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], \n",
    "               [101, 1327, 1164, 5450, 23434, 136, 102]], \n",
    " 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], \n",
    "                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], \n",
    "                    [0, 0, 0, 0, 0, 0, 0]], \n",
    " 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], \n",
    "                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], \n",
    "                    [1, 1, 1, 1, 1, 1, 1]]}\n",
    "```\n",
    "\n",
    "### 填充（Pad）\n",
    "\n",
    "句子并不总是相同的长度，这可能是一个问题，因为张量，即模型输入，需要有一个统一的形状。填充是一种策略，通过向较短的句子添加特殊的填充标记来确保张量是矩形的。\n",
    "\n",
    "将padding参数设置为True，以填充批次中较短的序列，以匹配最长的序列："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "batch_sentences = [\n",
    "    \"But what about second breakfast?\",\n",
    "    \"Don't think he knows about second breakfast, Pip.\",\n",
    "    \"What about elevensies?\",\n",
    "]\n",
    "encoded_input = tokenizer(batch_sentences, padding=True)\n",
    "print(encoded_input)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], \n",
    "               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], \n",
    "               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], \n",
    " 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], \n",
    "                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], \n",
    "                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], \n",
    " 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], \n",
    "                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], \n",
    "                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}\n",
    "```\n",
    "\n",
    "第一句和第三句现在用0来填充，因为它们比较短。\n",
    "\n",
    "### 截取（Truncation）\n",
    "\n",
    "另外一种情况，有时一个序列可能太长，模型无法处理。在这种情况下，你需要将序列截断成一个较短的长度。\n",
    "\n",
    "将参数truncation设置为True，以将序列截断到模型所能接受的最大长度："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "batch_sentences = [\n",
    "    \"But what about second breakfast?\",\n",
    "    \"Don't think he knows about second breakfast, Pip.\",\n",
    "    \"What about elevensies?\",\n",
    "]\n",
    "encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)\n",
    "print(encoded_input)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "{'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], \n",
    "               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], \n",
    "               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], \n",
    " 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], \n",
    "                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], \n",
    "                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], \n",
    " 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], \n",
    "                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], \n",
    "                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}\n",
    "```\n",
    "\n",
    "> 查看[Padding and truncation](https://huggingface.co/docs/transformers/pad_truncation)的概念指南，了解更多不同的填充和截断参数。\n",
    "\n",
    "### 构建张量(Build tensors)\n",
    "\n",
    "最后，你想让标记器返回实际的张量，这些张量被送入模型。\n",
    "\n",
    "将return_tensors参数设置为PyTorch的pt，或TensorFlow的tf："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "batch_sentences = [\n",
    "    \"But what about second breakfast?\",\n",
    "    \"Don't think he knows about second breakfast, Pip.\",\n",
    "    \"What about elevensies?\",\n",
    "]\n",
    "encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors=\"pt\")\n",
    "print(encoded_input)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "{'input_ids': tensor([[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0],\n",
    "                      [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102],\n",
    "                      [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]]), \n",
    " 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
    "                           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],\n",
    "                           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), \n",
    " 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0],\n",
    "                           [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
    "                           [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 音频\n",
    "\n",
    "对于音频任务，你需要一个特征提取器来为模型准备你的数据集。特征提取器被设计用来从原始音频数据中提取特征，并将其转换为张量。\n",
    "\n",
    "加载MInDS-14数据集（参见数据集教程，了解如何加载数据集的更多细节），看看你如何用音频数据集使用特征提取器："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "from datasets import load_dataset, Audio\n",
    "\n",
    "dataset = load_dataset(\"PolyAI/minds14\", name=\"en-US\", split=\"train\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "访问音频列的第一个元素，看一下输入的情况。 调用音频列会自动加载并重新采样音频文件："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "dataset[0][\"audio\"]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "{'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,\n",
    "         0.        ,  0.        ], dtype=float32),\n",
    " 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',\n",
    " 'sampling_rate': 8000}\n",
    "```\n",
    "\n",
    "这将返回三个项目：\n",
    "* array 是作为一个一维阵列加载的语音信号--并有可能重新采样。\n",
    "* path 指向音频文件的位置。\n",
    "* sampling_rate 是指每秒钟测量多少个语音信号的数据点。\n",
    "\n",
    "在本教程中，你将使用Wav2Vec2模型。看一下模型卡，你会知道Wav2Vec2是在16kHz采样的语音音频上进行预训练的。重要的是，你的音频数据的采样率与用于预训练模型的数据集的采样率相匹配。如果你的数据的采样率不一样，那么你需要重新采样你的数据。\n",
    "\n",
    "1. 使用Datasets的cast_column方法将采样率提高到16kHz："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "dataset = dataset.cast_column(\"audio\", Audio(sampling_rate=16_000))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. 再次调用音频列以重新取样音频文件："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "dataset[0][\"audio\"]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "{'array': array([ 2.3443763e-05,  2.1729663e-04,  2.2145823e-04, ...,\n",
    "         3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),\n",
    " 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',\n",
    " 'sampling_rate': 16000}\n",
    "```\n",
    "\n",
    "接下来，加载一个特征提取器，对输入进行规范化处理和填充。当对文本数据进行填充时，对于较短的序列会添加一个0。这个想法同样适用于音频数据。特征提取器默认在数组中添加0。\n",
    "\n",
    "用AutoFeatureExtractor.from_pretrained()加载特征提取器："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "from transformers import AutoFeatureExtractor\n",
    "\n",
    "feature_extractor = AutoFeatureExtractor.from_pretrained(\"facebook/wav2vec2-base\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "将音频阵列传递给特征提取器。我们还建议在特征提取器中加入sampling_rate参数，以便更好地调试，避免可能出现的任何潜在错误。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "audio_input = [dataset[0][\"audio\"][\"array\"]]\n",
    "feature_extractor(audio_input, sampling_rate=16000)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "{'input_values': [array([ 3.8106556e-04,  2.7506407e-03,  2.8015103e-03, ...,\n",
    "        5.6335266e-04,  4.6588284e-06, -1.7142107e-04], dtype=float32)]}\n",
    "```\n",
    "\n",
    "就像标记器一样，你可以应用填充或截断来处理一个批次中的可变序列。看一下这两个音频样本的序列长度："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "dataset[0][\"audio\"][\"array\"].shape  # (173398,)\n",
    "\n",
    "dataset[1][\"audio\"][\"array\"].shape  # (106496,)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "创建一个函数来预处理数据集，使音频样本具有相同的长度。指定一个最大的样本长度，特征提取器将填充或截断序列以匹配它："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "def preprocess_function(examples):\n",
    "    audio_arrays = [x[\"array\"] for x in examples[\"audio\"]]\n",
    "    inputs = feature_extractor(\n",
    "        audio_arrays,\n",
    "        sampling_rate=16000,\n",
    "        padding=True,\n",
    "        max_length=100000,\n",
    "        truncation=True,\n",
    "    )\n",
    "    return inputs"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对数据集中的前几个例子应用preprocess_function："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "processed_dataset = preprocess_function(dataset[:5])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在，样本的长度是一样的，与指定的最大长度相匹配。你现在可以把你处理过的数据集传递给模型了!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "processed_dataset[\"input_values\"][0].shape  # (100000,)\n",
    "\n",
    "processed_dataset[\"input_values\"][1].shape  # (100000,)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 计算机视觉\n",
    "\n",
    "对于计算机视觉任务，你需要一个图像处理器来为模型准备你的数据集。图像预处理--将图像转换成模型所期望的输入--包括几个步骤。这些步骤包括但不限于调整大小、归一化、色彩通道校正和将图像转换为张量。\n",
    "\n",
    "> 图像预处理往往是在某种形式的图像增强之后进行的。图像预处理和图像增强都是对图像数据的转换，但它们的目的不同：\n",
    "> * 图像增强以一种有助于防止过度拟合和增加模型的稳健性的方式改变图像。你可以创造性地增加你的数据 - 调整亮度和颜色、裁剪、旋转、调整大小、缩放等。然而，要注意不要用你的扩增来改变图像的含义。\n",
    "> * 图像预处理保证了图像与模型的预期输入格式相符。在对计算机视觉模型进行微调时，必须对图像进行预处理，与最初训练模型时完全一样。\n",
    "> 你可以使用任何你喜欢的库来进行图像扩增。对于图像预处理，使用与该模型相关的ImageProcessor。\n",
    "\n",
    "加载food101数据集（参见[数据集](https://huggingface.co/docs/datasets/load_hub.html)教程，了解如何加载数据集的更多细节），看看如何用计算机视觉数据集使用图像处理器：\n",
    "\n",
    "> 使用数据集分割参数，只加载训练分割的小样本，因为数据集相当大!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "dataset = load_dataset(\"food101\", split=\"train[:100]\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来，用数据集图像功能看一下图像："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "dataset[0][\"image\"]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![图片](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vision-preprocess-tutorial.png)\n",
    "\n",
    "用AutoImageProcessor.from_pretrained()加载图像处理器："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "from transformers import AutoImageProcessor\n",
    "\n",
    "image_processor = AutoImageProcessor.from_pretrained(\"google/vit-base-patch16-224\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "首先，让我们添加一些图像增强功能。你可以使用你喜欢的任何库，但在本教程中，我们将使用torchvision的transforms模块。如果你对使用另一个数据增强库感兴趣，可以在Albumentations或Kornia笔记本中学习如何使用。\n",
    "\n",
    "1. 在这里，我们使用Compose将几个变换连锁起来--RandomResizedCrop和ColorJitter。请注意，对于调整大小，我们可以从图像处理器中获得图像的尺寸要求。对于某些模型来说，需要精确的高度和宽度，对于其他模型来说，只需要定义最短边。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "from torchvision.transforms import RandomResizedCrop, ColorJitter, Compose\n",
    "\n",
    "size = (\n",
    "    image_processor.size[\"shortest_edge\"]\n",
    "    if \"shortest_edge\" in image_processor.size\n",
    "    else (image_processor.size[\"height\"], image_processor.size[\"width\"])\n",
    ")\n",
    "\n",
    "_transforms = Compose([RandomResizedCrop(size), ColorJitter(brightness=0.5, hue=0.5)])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. 该模型接受 pixel_values 作为其输入。ImageProcessor可以负责对图像进行标准化处理，并生成适当的张量。创建一个函数，将图像增强和图像预处理结合起来，对一批图像进行处理，并生成pixel_values："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "def transforms(examples):\n",
    "    images = [_transforms(img.convert(\"RGB\")) for img in examples[\"image\"]]\n",
    "    examples[\"pixel_values\"] = image_processor(images, do_resize=False, return_tensors=\"pt\")[\"pixel_values\"]\n",
    "    return examples"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 在上面的例子中，我们设置do_resize=False，因为我们已经在图像增强转换中调整了图像的大小，并且利用了相应的图像处理器的大小属性。如果你在图像增强过程中不调整图像的大小，可以不设置这个参数。默认情况下，ImageProcessor将处理大小的调整。\n",
    "> 如果你希望将图像标准化作为扩增转换的一部分，请使用image_processor.image_mean和image_processor.image_std值。\n",
    "\n",
    "3. 然后使用Datasets set_transform来实时应用变换："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "dataset.set_transform(transforms)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4. 现在当你访问图像时，你会发现图像处理器已经添加了pixel_values。你现在可以把你处理过的数据集传给模型了!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "dataset[0].keys()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面是应用变换后的图像的样子。图像被随机裁剪，它的颜色属性也不同。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "img = dataset[0][\"pixel_values\"]\n",
    "plt.imshow(img.permute(1, 2, 0))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![图片](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/preprocessed_image.png)\n",
    "\n",
    "> 对于像物体检测、语义分割、实例分割和全景分割这样的任务，ImageProcessor提供后处理方法。这些方法将模型的原始输出转化为有意义的预测，如边界盒或分割图。\n",
    "\n",
    "### 填充(Pad)\n",
    "\n",
    "在某些情况下，例如在微调DETR时，该模型在训练时应用比例增强。这可能会导致图像在一个批次中出现不同的尺寸。你可以使用DetrImageProcessor.pad_and_create_pixel_mask()，并定义一个自定义的collate_fn，将图像批到一起。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "def collate_fn(batch):\n",
    "    pixel_values = [item[\"pixel_values\"] for item in batch]\n",
    "    encoding = image_processor.pad_and_create_pixel_mask(pixel_values, return_tensors=\"pt\")\n",
    "    labels = [item[\"labels\"] for item in batch]\n",
    "    batch = {}\n",
    "    batch[\"pixel_values\"] = encoding[\"pixel_values\"]\n",
    "    batch[\"pixel_mask\"] = encoding[\"pixel_mask\"]\n",
    "    batch[\"labels\"] = labels\n",
    "    return batch"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 多式联运(Multimodal)\n",
    "\n",
    "对于涉及多模态输入的任务，你需要一个处理器来为模型准备你的数据集。一个处理器将两个处理对象结合在一起，如标记器和特征提取器。\n",
    "\n",
    "加载LJ语音数据集（参见数据集教程，了解如何加载数据集的详情），看看如何使用处理器进行自动语音识别（ASR）："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "lj_speech = load_dataset(\"lj_speech\", split=\"train\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对于ASR来说，你主要关注的是音频和文本，所以你可以删除其他栏目："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "lj_speech = lj_speech.map(remove_columns=[\"file\", \"id\", \"normalized_text\"])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在看一下音频和文本栏："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "lj_speech[0][\"audio\"]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "{'array': array([-7.3242188e-04, -7.6293945e-04, -6.4086914e-04, ...,\n",
    "         7.3242188e-04,  2.1362305e-04,  6.1035156e-05], dtype=float32),\n",
    " 'path': '/root/.cache/huggingface/datasets/downloads/extracted/917ece08c95cf0c4115e45294e3cd0dee724a1165b7fc11798369308a465bd26/LJSpeech-1.1/wavs/LJ001-0001.wav',\n",
    " 'sampling_rate': 22050}\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "lj_speech[0][\"text\"]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "'Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition'\n",
    "```\n",
    "\n",
    "请记住，你应该始终对你的音频数据集的采样率进行重新采样，以匹配用于预训练模型的数据集的采样率！"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "lj_speech = lj_speech.cast_column(\"audio\", Audio(sampling_rate=16_000))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "用AutoProcessor.from_pretrained()加载一个处理器："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "from transformers import AutoProcessor\n",
    "\n",
    "processor = AutoProcessor.from_pretrained(\"facebook/wav2vec2-base-960h\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. 创建一个函数来处理数组中包含的音频数据到input_values，并将文本标记为标签。这些都是模型的输入："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "def prepare_dataset(example):\n",
    "    audio = example[\"audio\"]\n",
    "\n",
    "    example.update(processor(audio=audio[\"array\"], text=example[\"text\"], sampling_rate=16000))\n",
    "\n",
    "    return example"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. 对一个样本应用prepare_dataset函数："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "prepare_dataset(lj_speech[0])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "处理器现在已经添加了input_values和标签，而且采样率也已经正确降频到16kHz。你现在可以把你处理过的数据集传给模型了!\n",
    "\n",
    "---\n",
    "v4.30.0"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
